Lecture 9: Optimization tricks part 1

In this lecture we talk about our first batch of new optimization tricks that can help us overcome common problems encountered when optimizing nonlinear supervised learners like neural networks.

Press the button 'Toggle code' below to toggle code on and off for entire this presentation.

In [1]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

Normalized Gradient Descent

  • Thus far we have been using gradient descent steps of the form
$$ \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g(\mathbf{w}^{\,k-1}) $$
  • This is - more technically speaking - called unnormalized gradient descent - why this name?
  • Because we only care about the descent direction when taking steps, and not its magnitude
  • So technically speaking we can instead take steps in the unit-direction given by the negative gradient: i.e., $- \frac{\nabla g(\mathbf{w})}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2 }$
  • Note that this means we normalize each partial derivative evaluation by the length of the entire gradient
$$ -\frac{\nabla g(\mathbf{w})}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2 } = \begin{bmatrix} \frac{-\frac{\partial}{\partial w_1}g\left(\mathbf{w}\right)}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2} \\ \frac{-\frac{\partial}{\partial w_2}g\left(\mathbf{w}\right)}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2} \\ \vdots \\ \frac{-\frac{\partial}{\partial w_N}g\left(\mathbf{w}\right)}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2} \end{bmatrix} $$
  • This is the normalized gradient descent step
$$ \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha\frac{\nabla g(\mathbf{w}^{k-1})}{\left\Vert \nabla g(\mathbf{w}^{k-1}) \right\Vert_2 } $$
  • Our original step is called unnormalized gradient descent because we do not normalize the negative gradient direction
$$ \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g(\mathbf{w}^{\,k-1}) $$
  • Whats the difference between them?
  • In term of theoretical things (like convergence) = nothing
  • Besides setting $\alpha \longleftarrow \frac{\alpha}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2}$ turns the unnormalized gradient descent step $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g\left(\mathbf{w}^{\,k-1}\right)$ into a normalized one.
  • However in terms of practical engineering usage = quite a bit! Particularly when using a fixed steplength param $\alpha$ as we have been doing.
  • First, notice the distance traveled at an unnormalized gradient descent step $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g(\mathbf{w}^{\,k-1})$
$$ \left\Vert \mathbf{w}^{\,k} - \mathbf{w}^{\,k-1} \right\Vert_2 = \left\Vert -\alpha \nabla g\left(\mathbf{w}^{\,k-1}\right)\right\Vert_2 = \alpha\left\Vert \nabla g\left(\mathbf{w}^{\,k-1}\right) \right\Vert_2 $$
  • That is, the distance traveled is equal to the steplength parameter $\alpha$ times the length of the gradient at $\mathbf{w}^{k-1}$
  • Hence the parameter $\alpha$ only partially controls the length of each step, the gradient adaptively changes the distance we travel at each step
  • With convex functions: this forces initial steps to be quite large but afterwards we slow down a ton since the gradient is vanishing close to a minimum
  • This behavior has been fine unless our convex functions have long narrow valleys leading to global minima
  • With non-convex functions: this will force gradient descent to halt near any stationary point, including e.g., saddle points
  • This behavior is highly undesirable - we want to find minima!
  • Now notice the legnt of each step with the normalized form of gradient descent $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha\frac{\nabla g(\mathbf{w}^{k-1})}{\left\Vert \nabla g(\mathbf{w}^{k-1}) \right\Vert_2 }$
  • The general length of the $k^{th}$ step is equal to
$$ \left\Vert \mathbf{w}^{\,k} - \mathbf{w}^{\,k-1} \right\Vert_2 = \left\Vert -\alpha \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }\right\Vert_2 = \alpha \left\Vert \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }\right\Vert_2 = \alpha $$
  • It is given exactly by the steplength parameter $\alpha$, and does not depend on the length of the gradient
  • A fixed value for $\alpha$ can be used, but also diinishing steplength like $\alpha = \frac{1}{k}$ used to ensure convergence
  • Hence the normalized descent step $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha\frac{\nabla g(\mathbf{w}^{k-1})}{\left\Vert \nabla g(\mathbf{w}^{k-1}) \right\Vert_2 }$ will not suffer from
 - Slowing down on a convex function (or any long narrow valley) due to a vanishing gradient
- Getting caught in the non-minimum stationary point (e.g., a saddle point) of a non-convex function

Examples applying normalized gradient descent

  • Helps blast through saddle points of non-convex functions
In [52]:
# what function should we play with?  Defined in the next line.
g = lambda w: np.maximum(0,(3*w - 2.3)**3 + 1)**2 + np.maximum(0, (-3*w + 0.7)**3 + 1)**2

# run the visualizer for our chosen input function, initial point, and step length alpha
demo = optlib.gradient_descent_demos.visualizer();

demo.animate_2d(g=g, w_init = 0,steplength = 0.01,max_its = 55,version = 'normalized',wmin = 0,wmax = 1)
Out[52]:



In [3]:
# what function should we play with?  Defined in the next line.
g = lambda w: np.maximum(0,(3*w - 2.3)**3 + 1)**2 + np.maximum(0, (-3*w + 0.7)**3 + 1)**2

# run the visualizer for our chosen input function, initial point, and step length alpha
demo = optlib.gradient_descent_demos.visualizer();

demo.compare_versions_2d(g=g, w_init = 0,steplength = 0.01,max_its = 80,version = 'normalized',wmin = 0,wmax = 1)
  • Helps get down long narrow valleys leading to a minimum - even on convex functions
  • For example, fitting poly features to a noisy sinusoid dataset
In [32]:
# plot cost function histories
histories = [weight_history_1,weight_history_2]
compare_regression_histories(histories,least_squares)
In [33]:
# create figure and plot data
fig, ax = plt.subplots(1, 1, figsize=(6,3))
ax.scatter(x,y,color = 'k',edgecolor = 'w')

# fit regression model to data
w1 = weight_history_1[-1]
w2 = weight_history_2[-1]

# make fits
x_vals = np.linspace(min(x),max(x),200)
y_vals_1 = [predict(v,w1) for v in x_vals]
y_vals_2 = [predict(v,w2) for v in x_vals]

# plot it
ax.plot(x_vals,y_vals_1,color = 'blue')
ax.plot(x_vals,y_vals_2,color = 'orange')

plt.show()

Momentum

  • Slight modification of gradient descent (either version) to deal with long narrow valleys - gradient descent zig-zags around in these areas
  • Occur in both convex and non-convex cost functions
  • Basic idea: add weighted difference of previous weight updates to even out the zig-zagging
  • e.g., for unnormalized version of gradient descent
\begin{equation} \mathbf{w}^{k+1} = \mathbf{w}^k - \alpha \nabla g\left(\mathbf{w}^k\right) + \beta \left(\mathbf{w}^{k} - \mathbf{w}^{k-1}\right) \end{equation}
  • e.g., for unnormalized version of gradient descent ( $\beta \in (0,1)$)
\begin{equation} \mathbf{w}^{k+1} = \mathbf{w}^k - \alpha \nabla g\left(\mathbf{w}^k\right) + \beta \left(\mathbf{w}^{k} - \mathbf{w}^{k-1}\right) \end{equation}

  • Works the same for normalized gradient descent as well
  • Momentum often written in equivalent manner but more mysterious mannar (see notes for details)
\begin{equation} \begin{array} \ \mathbf{z}^{k+1} = \beta\,\mathbf{z}^{k} + \nabla g\left(\mathbf{w}^k\right) \\ \mathbf{w}^{k+1} = \mathbf{w}^{k} - \alpha \, \mathbf{z}^{k+1} \end{array} \end{equation}

Momentum examples

  • For example, take the quadratic
\begin{equation} g(\mathbf{w}) = a + \mathbf{b}^T\mathbf{w} + \mathbf{w}^T\mathbf{C}\mathbf{w} \end{equation}

where $a = 0$, $\mathbf{b} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$, and $\mathbf{C} = \begin{bmatrix} 1\,\,0 \\ 0 \,\, 12\end{bmatrix}$

  • Use unnormalized descent, $\alpha = 0.1$, and $beta = 0.3$
In [41]:
# visualize
import contour_run_comparison
demo = contour_run_comparison.Visualizer()
demo.show_paths(g, weight_history_1,weight_history_4,num_contours = 20)
  • For example, take the quadratic
\begin{equation} g(\mathbf{w}) = a + \mathbf{b}^T\mathbf{w} + \mathbf{w}^T\mathbf{C}\mathbf{w} \end{equation}

where $a = 0$, $\mathbf{b} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$, and $\mathbf{C} = \begin{bmatrix} 0\,\,0 \\ 0 \,\, 12\end{bmatrix}$

In [44]:
# visualize
demo = contour_run_comparison.Visualizer()
demo.show_paths(g, weight_history_1,weight_history_2,num_contours = 20)
  • 2 class classification
In [63]:
import classification_2d_demos_v2

# create instance of logisic regression demo and load in data, cost function, and descent history
demo3 = classification_2d_demos_v2.Visualizer(data,tanh_least_squares)

# animate descent process
demo3.animate_runs(weight_history_1,weight_history_4,num_contours = 25)
Out[63]: